Wozniak ef a/. BMC Bioinformatics 2014, 15:65 
http://www.biomedcentral.eom/1 471 -21 05/1 5/65 



Bioinformatics 



METHODOLOGY ARTICLE OpenAccess 



eCAMBer: efficient support for large-scale 
comparative analysis of multiple bacterial 
strains 

Michal Wozniak 1 ' 2 *, Limsoon Wong 2 and Jerzy Tiuryn 1 



Abstract 

Background: Inconsistencies are often observed in the genome annotations of bacterial strains. Moreover, these 
inconsistencies are often not reflected by sequence discrepancies, but are caused by wrongly annotated gene starts 
as well as mis-identified gene presence. Thus, tools are needed for improving annotation consistency and accuracy 
among sets of bacterial strain genomes. 

Results: We have developed eCAMBer, a tool for efficiently supporting comparative analysis of multiple bacterial 
strains within the same species. eCAMBer is a highly optimized revision of our earlier tool, CAMBer, scaling it up for 
significantly larger datasets comprising hundreds of bacterial strains. eCAMBer works in two phases. First, it transfers 
gene annotations among all considered bacterial strains. In this phase, it also identifies homologous gene families and 
annotation inconsistencies. Second, eCAMBer, tries to improve the quality of annotations by resolving the gene start 
inconsistencies and filtering out gene families arising from annotation errors propagated in the previous phase. 

Conculsions: eCAMBer efficiently identifies and resolves annotation inconsistencies among closely related bacterial 
genomes. It outperforms other competing tools both in terms of running time and accuracy of produced annotations. 
Software, user manual, and case study results are available at the project website: http : //bioputer . mimuw . 
edu.pl/ecamber. 
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Background 

The number of bacterial genome sequences available in 
public databases is growing rapidly, due to advances in 
high-throughput sequencing technologies [1]. For exam- 
ple, from June 8, 2011 to February 12, 2014, the total num- 
ber of whole-genome sequences available in the PATRIC 
database grew from 3303 to 14114 [2]. By December 
16, 2013, there were 1452 whole-genome sequences of 
Escherischia coli and 435 whole-genome sequences of 
Salmonella enterica strains available in the database. 

Larger datasets of bacterial genome sequences enable 
new interesting comparative genome analysis [3-7]. How- 
ever, it has been shown that a wide range of comparative 
analyses (such as identification of overlapping genes and 
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estimation of core genome size) may be complicated or 
biased due to the common inconsistencies in genome 
annotations among closely related bacterial strains [8-13]. 

The observed inconsistencies are mostly of two types: 
mis-identification of gene presence (false positive and 
false negative predictions are possible) and inconsistent 
gene starts (or TIS — translation initiation sites). It has 
also been argued that most of these inconsistencies are not 
reflected by sequence discrepancies, but arise as a result 
of different annotation methodologies applied by different 
laboratories [10,14]. In fact, has been shown that using the 
same tool to annotate a set of bacterial genomes increases 
annotation consistency [10]. However, as we will observe 
later in section "Annotation consistency", these annota- 
tion inconsistencies among closely related genomes can 
even arise from annotations produced by the same anno- 
tation tool or made by the same laboratory. 
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There is also an interesting question regarding TIS 
inconsistencies: can a bacterial gene have multiple TISs? 
For example, it has been recently estimated, based on an 
experimental study, that as many as 26.5% of genes in 
E. coli may have multiple transcription start sites [15]; that 
may also suggest multiple TISs. Nevertheless, according 
to our knowledge, multiple real TISs in bacteria is not a 
confirmed phenomenon yet. It should also be noted that 
there is only one TIS per gene in manually curated anno- 
tations. Thus, in this study, we assume that each gene has 
only one correct TIS. 

Interestingly, the presence of annotation inconsisten- 
cies is an expected phenomenon when single-genome 
prediction tools are applied independently. For example, 
suppose we annotate independently k = 20 genomes, and 
assume that the missing gene error rate is e = 0.035, 
which is the corresponding Prodigal [16] error rate esti- 
mated on the E. coli dataset. Then, since 1 — (1 — e) k = 
0.51, about 51% of core gene families would have at least 
one missing gene annotation. 

A promising idea to improve annotation accuracy by 
combining outputs of several single-genome annotation 
tools has been explored with a few proposed approaches 
[17-20]. However, these meta-approaches can be viewed 
as single-genome annotation tools. 

Recently, it has also been proposed that the accuracy of 
single-genome annotation tools can improved by compar- 
ative annotation among multiple genomes [21]. However, 
even though there are many annotation tools dedicated 
to a single-genome, there are relatively few tools sup- 
porting comparative annotation and analysis of multiple 
bacterial genomes [21]. Hence, there is a need to develop 
more tools to improve consistency of genome annotations 
across multiple bacterial strains. 

Mugsy-Annotator is a tool which may assist in the 
curation of annotations of multiple bacterial genomes 
by identifying annotation inconsistencies [22]. First, this 
tool computes whole-genome multiple alignment by 
employing Mugsy [23]. Then, based on annotated gene 
coordinates mapped on genomes in the multiple-genome 
alignment, Mugsy-Annotator identifies orthologous gene 
families, annotation inconsistencies and proposes changes 
to the input annotations. Notably, Mugsy-Annotator does 
not make any assumption about the reference strain. 
However, it suffers from the quadratic time complexity 
with respect to the number of strains, since in the first 
step it employs Mugsy to compute pairwise all-against-all 
alignments of whole genomes. 

Recently, two new mojority voting-like approaches have 
been proposed to improve annotation accuracy and con- 
sistency among multiple genomes: ORFcor [24] and GMV 
[25]. However, ORFcor requires a set of ortholog gene 
families to be supplied as the input, and GMV is embed- 
ded within a pipeline which starts from input genome 



sequences and genome annotations generated by Prodi- 
gal. It should also be noted, that since the GMV pipeline 
uses BLAST in the all-against-all manner it has quadratic 
time complexity with respect to the number of strains. 

In our previous work, we developed CAMBer [11], a 
tool conceptually similar to Mugsy-Annotator and the 
GMV pipeline. It supports comparative analysis of mul- 
tiple bacterial strains. CAMBer unifies input gene anno- 
tations by homologous gene transfer among all strains. 
Then, based on acceptable BLAST hits, it identifies 
orthologous gene families. During this procedure annota- 
tion inconsistencies are identified. Similarly, as in Mugsy- 
Annotator and the GMV pipeline, it does not make 
any assumption about the reference strain, and it has 
quadratic time complexity in the number of strains. 
This property makes both tools weakly scalable to large 
datasets. 

Another notable tool which employs the idea of com- 
paring gene annotations among closely related genomes 
is GenePRIMP [26]. This tool identifies and reports gene 
annotation anomalies based on protein BLAST queries 
run against the NCBI nr database. These reports are 
helpful for manual curation of genome annotations. A 
similar feature has also been implemented in CAMBerVis 
[27] — our previously published tool for visualization and 
analysis of annotation inconsistencies. 

In this work, we present a new version of CAM- 
Ber, which we call eCAMBer (efficient CAMBer). It also 
aims to identify annotation inconsistencies and orthol- 
ogous gene families. However, unlike Mugsy-Annotator 
and CAMBer, it has significantly better running time by 
taking advantage of working with highly similar genome 
sequences. A dramatic speed up offered by eCAMBer 
can be seen when working with a large number of bac- 
terial strains. The running time is reduced (for 41 strains 
of E. coli) from 2 days, in the case of CAMBer, to less 
than half an hour, in the case of eCAMBer. Furthermore, 
eCAMBer tries to resolve annotation inconsistencies in 
order to produce more accurate annotations. For this pur- 
pose, it implements a majority voting-like approach for 
selecting the most reliable TISs and implements a proce- 
dure for identification and removal of gene families which 
are likely to be propagated annotation errors. 

The concept of annotation may refer to many differ- 
ent aspects of attaching biological information to genome 
sequences, such as: identifying of gene locations, assign- 
ing functions to genes or assigning network context to 
gene products [14,28,29]. In this work we focus on identi- 
fying locations of protein-coding genes. We use the term 
gene annotation (or ORF annotation) to refer to genome 
coordinates of a protein-coding gene from its translation 
initiation site TIS (alternatively called gene start) to the 
nearest stop codon (alternatively called gene end). Note 
that each ORF annotation is unambiguously determined 
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by specifying strand and position of its start codon. Thus, 
we can use the term TIS annotation as a synonym to 
ORF annotation. We will be using this when multiple ORF 
annotations share the same stop codon. 

Methods 

eCAMBer requires as its input a set of genome sequences 
and annotations for multiple bacterial genomes. It should 
be noted, however, that eCAMBer supports automatic 
download of bacterial annotations from the PATRIC [2] 
database and, as an option, it allows the use of Prodigal to 
generate the input annotations. It works in two phases. In 
the first phase it uses BLAST+ [30] to transfer each gene 
annotation among multiple strains. Based on the results of 
this procedure, homologous multigene clusters are identi- 
fied. In the second phase eCAMBer applies subsequently 
the procedures for refinement, TIS voting and clean up. 
Figure 1 presents a schematic view of these subsequent 
procedures of eCAMBer. 
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Figure 1 Schematic view of subsequent procedures in eCAMBer. 

Boxes of the chart represent the subsequent sets of annotations. 
Edges indicate application of eCAMBer procedures to process these 
annotations. We call a set of ORF annotations, multigene annotations, if 
multiple ORF annotations may share the same stop codon, indicating 
possible starts of translation (TISs). We use a notion of a multigene to 
represent multiple ORF annotations sharing the same stop codon. 



The main improvements in eCAMBer as compared to 
CAMBer [11] are: 

• Significant speed up of the closure procedure for 
unifying genome annotations among bacterial strains; 

• Modified refinement procedure for splitting 
homologous gene families into orthologous gene 
clusters; 

• New TIS voting procedure for selecting the most 
reliable TIS; 

• New clean up procedure for removal of gene clusters 
that are likely to be gene annotation errors 
propagated during the closure procedure. 

Here, we describe the details of the above listed proce- 
dures. The default values for parameters introduced below 
were chosen arbitrarily. However, based on our experi- 
ments, the program is robust for other choices of the 
parameters from a reasonable spectrum. eCAMBer allows 
users to specify values of all the parameters. 

The closure procedure 

The closure procedure is the first step of eCAMBer. The 
input consists of genome sequences and genome anno- 
tations for a set of closely related bacterial strains. In 
this procedure gene annotations are iteratively transferred 
among the set of considered strains, until no new ORFs 
(open reading frames) are identified.More precisely, a 
gene annotation is transferred to a new location if its 
BLAST hit extended to the nearest in-frame stop codon 
is acceptable. Analogous to CAMBer, a BLAST hit exten- 
sion to the nearest stop codon is acceptable if it satisfies 
the following conditions: 

• The hit has one of the appropriate start codons: 
ATG, GTG, TTG, or the same start codon as in the 
query sequence; 

• The hit has its beginning aligned with the beginning 
of the query sequence; 

• The BLAST e-value score is below a given threshold 
e t (in the default setting e t = 10 -10 ); 

• The ratio of the length of the extended hit to the 
query length is less than 1 + p t and greater than 

1 — p t , where p t is a given threshold (in the default 
setting p t = 0.2); 

• The percentage of identity of the hit (calculated as the 
number of identities divided by the query sequence 
length, times 100) is above a length-dependent 
threshold given by the adaptation of the HSSP curve 
introduced in our previous work [11], defined by the 
parameter n t (in the default setting n t = 60.5). 

In this procedure eCAMBer, unlike CAMBer, takes 
advantage of working with closely related genomes. In 
contrast to the old approach, in each iteration, instead 
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of using each ORF sequence as a query, it first identifies 
groups of ORFs with exactly identical sequences. This 
approach avoids use of the same ORF sequence multiple 
times as a BLAST query. 

The pseudocode for the closure procedure implemented 
in eCAMber is given in Algorithm 1, which we now 
describe in more details. First, we start with the set of 
input annotations A®, for each strain s in the set of con- 
sidered strains S. Each ORF annotation (or simply ORF) is 
defined by a tuple (start, end, strand, contig, strain). Then, 
in ith iteration we compute the set of BLAST queries Q l 
as the set of distinct ORF sequences among all strains, 
which have not been used as BLAST queries yet. Next, 
we calculate in parallel, for each strain, BLAST results 
for all sequence queries in Q l . For each strain s e S, 
all acceptable BLAST hit extensions H\ are added to the 
strain annotations, defining A 1 ^ 1 <— A l s U H l s . Next, the 
set of newly identified sequences across all genomes H l 
is computed, which is then used to update the set of 
BLAST queries for the next iteration Q /+1 <- H l \ D\ 
where D l denotes the set of all distinct sequences before 
the /th annotation. The procedure stops when no new 
ORF sequences are identified, hence Q l = 0. For each 
strain s e S, we denote by A c s the set of annotations 
produced by the closure procedure above. We further 
denote by A c the set of all ORFs produced by the closure 
procedure. 



Algorithm 1 The closure procedure (pseudocode) 
Require: A set S of bacterial strains; and for each s e S, a 
set A® of annotations, a set G s of sequences constituting 
the genome of s, and a mapping function sequences s (A) 
which returns the set of sequences in the genome G s 
corresponding to the set of annotations A. 
qO ^_ D o ^_ \J seS sequences s (A° s ) 

while Q< ^ 0 do 
for all s e S do 

H\ <- acceptable BLAST hit extensions from Q l 

on genome G s 

A l + l +- A i s U H l s 

end for{The above operations are done in parallel for 

each s e S. Also, for a query sequence q e if 

its BLAST hits are available in a database of precom- 

puted BLAST results, eCAMBer takes results from 

the database instead.} 

H l |J seS sequences s (H l s ) 
D i+i D i y H i 

Qi+l <_ \ D l 

i <r- i + 1 

end while 

return annotations A l s , for alls e S 



Here, we also recall the notion of a multigene, intro- 
duced in our previous work [11], to account for the situ- 
ation when multiple ORFs share the same stop codon in 
the annotations produced during the closure procedure. 
These ORFs are called multigene elements and represent 
putative gene translation units. Each multigene is repre- 
sented by a tuple {end, strand, contig, strain, elts), where 
elts is the set of ORFs constituting the multigene. Also, for 
each strain s e S, we denote by M c s the set of multigenes 
resulting from the closure procedure. 

Figure 2 presents a schematic view of the implementa- 
tion of the closure procedure in eCAMBer. 

The careful reader may also notice two important 
differences between the closure procedure in CAM- 
Ber and eCAMBer. In particular, eCAMBer uses unique 
ORF sequences, rather than ORF annotations, as queries 
against all strain genomes and, thus, does not repeat a 
BLAST query when the same ORF sequence corresponds 
to multiple ORF annotations. In contrast, firstly, CAM- 
Ber uses all ORF sequences as queries and, thus, may 
repeat a query BLAST several times. Secondly, CAMBer 
BLASTs a query against all strains' genomes except the 
strain from which the query is taken. The second differ- 
ence may potentially lead to different outcomes generated 
by these two approaches. 

Since BLAST computations are the most time- 
consuming operation in each iteration of the closure 
procedure, we express the time complexity of one itera- 
tion of the closure procedure by the number of performed 
BLAST computations. Let k = \S\ denote the number of 
considered strains and let n = max se s\A l s \ be the maximal 
number of gene annotations per strain, in iteration /. Let, 
d = \D l \ denote the number of distinct gene sequences 
among all gene annotations in all considered strains. 
Then, the time complexity of one iteration of the closure 
procedure implemented in eCAMBer can be expressed 
as 0(d • k), whereas it is 0(n • k 2 ) for CAMBer. Here, it 
should be noted that, potentially, if every annotated ORF 
sequence in S is different, then \D l \ = J2 se g \A l s \ = 0(n-k). 
However, as our case study experiments show, d is usually 
much smaller than n • k (see Figure 3). 

Importantly, the number of I/O operations per iteration 
is also significantly decreased, from 0(n • k 2 ) in CAMBer 
to 0(k) in eCAMBer. 

Consolidation graphs 

Having the closure procedure computed we represent its 
results in the form of graph structures, called consolida- 
tion graphs. 

First, we introduce the conceptual representation, called 
the ORF consolidation graph. In this graph Go = (Vo,Eq), 
each node o e Vo represents an ORF annotation in A c s , 
for some s e S. There is an undirected edge {o\, 02} e Eo 
between a pair of ORFs, if there is an acceptable BLAST 
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Figure 2 Schematic view of the closure procedure in eCAMBer. Schematic view of the closure procedure in eCAMBer. Annotations with 
multiple ORF annotations sharing the same stop codon may be produced. This situation gives rise to the notion of a multigene, which represents 
the set of ORFs sharing the same stop codon. These multigene elements correspond to putative gene translation units. 



hit from the sequence of o\ to 02 or from the sequence of 
02 to 01. We additionally assume, that there are no self- 
edges, L e. 01 7^ 02. 

Second, we recall the definition of the multigene consol- 
idation graph, introduced in our previous work [11]. In 
this graph Gm = (Vm>Em) each node m e Vm represents 
a multigene in M c s , for some s e S. There is an undirected 



edge {mi, WI2) £ Em between a pair of multigenes, if there 
is a pair of ORFs 01 G elts(m\) and 02 G elts(m2), such that 
there is an edge between them in the ORF consolidation 
graph (i.e., such that {01, 02} G Eo). 

Finally, we introduce the sequence consolidation graph, 
which is the structure used in the implementation of 
eCAMBer, as it is a compact representation of the infor- 
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Figure 3 Number of genes vs. number of distinct gene sequences. Comparison of the number of distinct gene sequences to the total number 
of genes in original annotations of 569 strains of £ coli. Strains were included cumulatively in the order of increasing genome sizes. In the figure the 
x-axis corresponds to the number of strains included. 



mation stored in the ORF consolidation graph and the 
multigene consolidation graph. In this graph Gs = 
(Vs,Es,Eb), nodes represent distinct ORF sequences. 
There are two types of edges, Eb called BLAST-hit edges, 
and Es called shared-end edges. There is an undirected 
shared-end edge {x,y} e Eg between a pair of sequence 
nodes if there is a multigene having two elements with 
these sequences. There is an undirected BLAST-hit edge 
{x,y} e Eb between a pair of sequence nodes if there is an 
acceptable BLAST hit from x to an ORF with sequence y, 
or if there is an acceptable BLAST from y to an ORF with 
sequence x. 

Figure 4 illustrates the correspondence between the 
ORF consolidation graph, sequence consolidation graph 
and the multigene consolidation graph. 

Homologous gene clusters 

The second step of eCAMBer is to determine homologous 
gene families as connected components of the multigene 
consolidation graph Gm- There is a natural one-to-one 
correspondence between the connected components of 
the multigene consolidation graph and the connected 
components of the sequence consolidation graph (the lat- 
ter connected components are obtained by taking the 
union of Es and So, in eCAMBer, we do this using 
connected components of the sequence consolidation 
graph Gs, because it tends to be smaller for closely related 
genomes. The obtained set of homologous gene families is 
represented as a set of disjoint multigene clusters, denoted 
by Cm. 

Refinement procedure 

The third step of eCAMBer is the refinement proce- 
dure. The goal of the refinement procedure is splitting 
the homologous gene families, represented by multigene 
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Figure 4 Schematic view on the correspondence between 
different representations of the closure procedure results in the 
form of consolidation graphs. Schematic view on the correspondence 
between different representations of the closure procedure results in 
the form of consolidation graphs; A) the genomes with marked ORF 
annotations. Round and square brackets represent the ORF start and 
stop codons, respectively. Round brackets with stars indicate original 
TIS annotations, whereas those without starts indicate the transferred 
TIS annotations; B) multigene representation of the annotations with 
the ORF consolidation graph edges shown between multigene 
elements, edges of the multigene consolidation graph are not shown 
for the readability; C) the sequence consolidation graph in which 
nodes correspond to the distinct ORF sequences, shared-end edges 
are drawn dashed, whereas BLAST-hit edges are drawn solid. 
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clusters, to obtain anchors. We call a multigene clus- 
ter an anchor, if it includes at most one multigene for 
every strain. Analogously, we call a multigene cluster 
non- anchor, if there is a strain which includes at least 
two multigenes in the cluster. Multigenes in the same 
anchor are potentially orthologous to each other, whereas 
a non-anchor contains at least two multigenes that are 
non-orthologous. Following CAMBer, we use genomic 
context information to decompose non-anchors into 
smaller multigene clusters that can emerge as anchors, as 
described below. 

The input for the refinement procedure consists of the 
set of multigene clusters Cm> the sequence consolidation 
graph Gm> and the multigene annotations M c s , for each 
strain s e S. We start with classifying the set Cm of multi- 
gene clusters into two disjoint sets of anchors and non- 
anchors, denoted Ca and Qv, respectively. We also sort 
all multigenes within strain contigs by positions of their 
stop codons. We reconstruct the subgraph of the multi- 
gene consolidation graph, called the refinement graph. In 
this graph Gr = (Vr,Er), nodes Vr are constituted by the 
subset of multigenes, which belong to non-anchor clus- 
ters. There is an edge {m\,m2} £ Er, between a pair of 
multigenes coming from different strains, if there is an 
edge {mi, ^2} e Em> and the two multigenes belong to the 
same multigene cluster. By E^ 1,S2 ^ we denote the subset of 
edges between multigenes from a pair of strains si and 52. 
We omit details of the reconstruction of the refinement 
graph for brevity. 

Then, for each unordered pair of strains {51,52} we per- 
form the following procedure in parallel. First, for each 
multigene m we identify a pair of its nearest neighbours 
which belong to anchors with a multigene element present 
on the opposite strain. Such left and right neighbours of 
m are denoted as l m 1,S2 ^ and r m l,S2 \ respectively. Then, for 
each edge {mi, m<i\ e E^ 1 '^ we check whether it is sup- 
ported in the sense that it satisfies one of the following 
conditions: (i) it connects multigenes belonging to a clus- 
ter, such that mi and m2 are its only elements in strains 
5i and 52; (ii) the corresponding pairs (l m \' S2 \ /j^ 2 *) and 
(r m \' S2 \ r m \' 52 b belong to the same anchor; (iii) the corre- 
sponding pairs (l m \' S2 \ r m \' S2 ^) and (r m \' S2 \ Im 1 / 2 ^) belong 
to the same anchor. If any of the four neighbours does not 
exist we substitute it with a dummy node, which virtually 
belongs to any anchor. 

Finally, we obtain the refined graph G| by removal of 
unsupported edges from Gr. Then, the set of connected 
components Cr of G| defines the set of multigene clus- 
ters after the split. Finally, we update the set of multigene 
clusters as CJ^ <— (Cm \ Qv) U Cr. 

The careful reader may also notice the differences 
between the refinement procedures implemented in 
CAMBer and eCAMBer. First, the refinement procedure 



in CAMBer performs in iterations until no multigene clus- 
ters can be split. In eCAMBer the refinement procedure 
consists of only one iteration. However, since the input 
and output for the procedure are of the same type, it can 
be used multiple times, until no new clusters are split. 
Second, the condition for an edge to be supported in 
eCAMBer is more relaxed than that in CAMBer. Both 
approaches, for a pair of multigenes on different strains, 
identify pairs of their nearest left and right neighbour 
multigenes (belonging to anchor clusters with elements on 
both strains). However, CAMBer checks the actual pres- 
ence of edges between the neighbours, whereas eCAMBer 
only checks if the identified neighbours match the same 
pair of clusters. This approach allows eCAMBer to avoid 
a costly reconstruction of the whole multigene consolida- 
tion graph. 

TIS voting procedure 

The fourth step of eCAMBer is the TIS voting procedure. 
The goal of the TIS voting procedure is to select the most 
reliable TIS for each multigene. To do this we implement 
an approach based on the concept of majority voting. This 
strategy has also been used to improve genome annotation 
accuracy in several recent studies [24,31]. 

In this procedure, for each multigene m in each multi- 
gene cluster c e C* w we try to find a TIS (originally 
annotated or transferred) that belongs to a connected 
component of the ORF consolidation graph, where the 
connected component satisfies the following two condi- 
tions: (i) it has TISs (originally annotated or transferred) 
present in at least 80% of the multigenes in c; and (ii) it has 
TISs originally annotated in at least 50% of the multigenes 
in c, or it has TISs originally annotated in at least twice the 
number of multigenes in c than all other connected com- 
ponents in c. If such a TIS is found, it is selected as the 
TIS for m. If such a TIS is not found, but m has an orig- 
inally annotated TIS, then the originally annotated TIS is 
selected as the TIS for m. If both of these two cases can- 
not be applied, the TIS corresponding to the longest ORF 
in the multigene m is selected. After the TIS voting proce- 
dure, every multigene has exactly one TIS selected. Thus, 
we obtain unambiguous TIS annotation for every gene. 

Note that the connected components of the sequence 
consolidation graph — after shared-end edges have been 
removed — are in a natural one-to-one correspondence 
with the connected components in the ORF consolidation 
graph. So in eCAMBer, we implement the TIS voting pro- 
cedure using the sequence consolidation graph, as it tends 
to be smaller for closely related genomes. 

Clean up procedure 

The last step of eCAMBer is the clean up procedure, 
which is designed to filter out multigene clusters which 
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are likely due to gene annotation errors propagated during 
the closure procedure. 

The input for this procedure consists of the set of multi- 
gene clusters and multigene annotations M c s , for each 
strain s e S. For each multigene cluster c e C* M we com- 
pute the following features: (i) /, the median multigene 
length in c; (ii) p, the ratio of the number of strains with at 
least one element from c to the total number of strains; (iii) 
r, the ratio of the number of strains with at least one orig- 
inally annotated multigene to the total number of strains 
with at least one element from c; (iv) v, the ratio of the 
number of multigenes in the cluster that are overlapped 
by a longer multigene to the total number of multigenes in 
the cluster. 

Then, we update the set of multigene clusters C^, by 
removing of multigene clusters for which (p < | or r < |) 
and (/ < 150 or v > 0.5). 

Other features 

In order to make eCAMBer more user friendly we have 
added a functionality for downloading genome sequences 
and genome annotations from the PATRIC database, 
for the set of selected strains within a species. The 
downloaded data is automatically formatted as input for 
eCAMBer. Additionally, eCAMBer integrates Prodigal to 
generate input gene annotations. 

Furthermore, eCAMBer generates output compatible 
with CAMberVis [27], a tool for simultaneous visualiza- 
tion of multiple genome annotations of bacterial strains. 
CAMBerVis also handles visualization of genome annota- 
tion inconsistencies. 

Results and discussion 

In this section we present the results of our experiments, 
which demonstrate that: (i) eCAMBer is much more effi- 
cient than CAMBer, Mugsy-Annotator and the GMV 
pipeline; (ii) it scales well to large datasets; (iii) it improves 
annotation consistency; (iv) it improves annotation accu- 
racy; and (v) eCAMBer outperforms Mugsy-Annotator 
and the GMV pipeline in terms of accuracy. 

Comparison of running times 

First, we compare the efficiency of eCAMBer and CAM- 
Ber by running the closure procedure for both tools on 
four datasets from our previous work on CAMBer [11]. 
All computations in this experiment were performed on 
the same desktop machine with 4 processor cores being 
used. In this experiment eCAMBer significantly outper- 
forms CAMBer (Table 1). For example, the running time 
on 9 strains of M. tuberculosis was reduced from about 1 
hour 22 minutes to only 42 seconds. 

Second, we also compare the running time of eCAM- 
Ber against CAMBer, Mugsy-Annotator and the GMV 
pipeline by running them on the four datasets from our 



Table 1 eCAMBer vs. CAMBer 

CAMBer eCAMBer 

Dataset BLASTs closure BLASTs closure 

2 strains of S. aureus 1m 47 s 2m5s 8s 18s 

9 strains of M. tuberculosis 1 h 22 m 1 h 27 m 27 s 41s 

22 strains of S. aureus 6h 6.5 h 3m15s 4m 

41 strains of E.coli 42 h 48.5 h 22 m 25 m 



previous work on CAMBer [11]. Since Mugsy-Annotator 
does not support multiple thread processing, in this 
experiment we use only one processor core for the compu- 
tations. Table 2 presents running times in this experiment. 
It is clear from this table that the running time speedup 
achieved by eCAMBer is much more pronounced for 
larger datasets. This is an expected phenomenon since the 
other tools have quadratic running times with respect to 
the number of strains included. 

The above results also suggest that eCAMBer scales well 
to larger datasets. 

Large case studies 

We examine the scalability of eCAMBer to large datasets 
by running it on 10 datasets for the 10 species with 
the highest number of sequenced strains in the PATRIC 
database [2], in the 16 March 2013 release. All datasets 
consist of genome sequences and annotations for the sets 
of strains within the same species. Experiments for all 
of these datasets were conducted on a machine with 24 
processor cores, out of which 20 were used. 

Table 3 shows a distribution of running times of all pro- 
cedures of eCAMBer. The reader may observe that the 
running times are not necessarily monotonically increas- 
ing with the number of strains. For example, the closure 



Table 2 Comparison of running times for different tools 



Dataset 


CAMBer 


eCAMBer 


Mugsy-Ann. 


GMV 


2 strains of 
S. aureus 


7m31 s 


26 s 


2 m 


21 m 


9 strains of M. 
tuberculosis 


4h 12m 


2 m 37 s 


1 h25m 


13h53m 


22 strains of 
S. aureus 


37h5m 


16m30s 


4h 11 m 


28 h 36 m 


41 strains of 
E. coli 


273 h 22 m 


1 h48m 


19 h 21 m 


368 h 31 m 



Comparison of running times between eCAMBer, CAMBer, Mugsy-Annotator 
and the GMV pipeline on four datasets from our previous work on CAMBer. All 
computations were executed on a machine with 1 processor core being used. 
The machine used in this computational experiment was different than the one 
used in the previous experiment. Columns correspond, in left-to-right order, to: 
short detaset description, total time consumed by the closure procedure in 
CAMBer, total time consumed by eCAMBer, total time consumed by 
Mugsy-Annotator, total time consumed the the GMV pipeline. 
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Table 3 eCAMBer on large datasets 



Detaset description Running times 



Species name 


Strains 


Genes 


Distinct seq. 


Closure 


Graph 


Refine. 


TISv. 


Clean up 


E. coli 


569 


2923165 


487141 (0.17) 


12 h 


59 m 


2 h 51 m 


14 m 


10 m 


S. enterica 


Zyj 




1/1/1/1 (c\ 1 


o n jo m 


1 Q m 
I O m 


30 m 


4 m 


4 m 


S. agalactiae 


250 


517648 


56215 (0.11) 


29 m 


2m 


5 m 


37 s 


53 s 


S. pneumoniae 


238 


529076 


99578 (0.19) 


2h29m 


5 m 


9m 


1 m30s 


1 m 10s 


S. aureus 


195 


523557 


98562 (0.19) 


1 h7m 


3m 


4 m 


1 m50s 


1 m 


H. pylori 


163 


267302 


208790 (0.78) 


1 h42m 


12m 


5 m 


5 m 10s 


2m 10s 


L. interrogans 


139 


649916 


175899 (0.27) 


1 h30m 


4 m 


7m 


1 m30s 


1 m50s 


V. cholerae 


130 


467413 


97258 (0.21) 


24 m 


2m 


2 m 20 s 


35 s 


51 s 


A. baumannii 


131 


487775 


1 29089 (0.27) 


34 m 


3m 


2 m 30 s 


52 s 


58 s 


B. cereus 


104 


602986 


395477 (0.66) 


1 h 13 m 


6 m 


3 m 50 s 


2 m 57 s 


1 m52s 



Running times of eCAMBer on the 10 large datasets. All experiments were performed on the same machine with 24 processor cores, where 20 of them were used. The 
columns correspond in left-to-right order to: the species name, the number of sequenced strains within the species, the total number of annotated genes, the number 
of distinct sequences for the set of annotated genes (in the brackets we also provide the ratio between the number of distinct sequences to the total number of 
annotated genes), running time to compute all BLASTS for the closure procedure, total running time to compute the closure procedure (including BLAST 
computations), the running time to construct the sequence consolidation graph, the running time to compute the refinement procedure, the running time for the TIS 
voting procedure, and the running time for the clean up procedure. 



procedure computations for the dataset of 162 strains of 
H. pylori took longer than the larger dataset of 195 strains 
of S. aureus. This may be explained by the fact that the 
total number of distinct sequences for annotated genes 
in S. aureus (98562) is much smaller than in H. pylori 
(208790). 

In order to further investigate the scalability of eCAM- 
Ber, we check how the number of distinct gene sequences 
increases, when more strains are included. For this exper- 
iment, we chose the largest dataset of 569 strains of 
E. coli. Next, we sorted all genomes from the smallest to 
the largest. The plots (Figure 3) present the number of 
annotated genes and the number of gene sequences in a 
cumulative manner. We observe that the total number of 
distinct sequences grows much slower than the total num- 
ber of gene annotations, suggesting sub-linear growth of 
the number of distinct gene sequences. Thus, according 
to our theoretical considerations, the algorithm imple- 
mented in eCAMBer for computing the closure procedure 
is sub-quadratic with respect to the number of strains 
included. 

This experiment also shows that the strategy applied in 
eCAMBer to work with unique ORF sequences, rather 
than ORF annotations, leads to a sequence consolidation 
graph that is significantly smaller than the correspond- 
ing ORF consolidation graph. For example, in the largest 
dataset for 569 strains of E. coli, there is about 12.4mln 
nodes (ORF annotations) and 2.8bln edges in the ORF 
consolidation graph, whereas there are only about 1.6mln 
nodes (unique ORF sequences), 1.3mln shared-end edges, 
and 55.9mln BLAST-hit edges in the sequence consolida- 
tion graph. 



Annotation consistency 

We also investigate ability of eCAMBer to identify anno- 
tation inconsistencies and to improve the consistency of 
annotations. As a case study, we use the set of 20 E. coli 
strains with manually curated annotations, deposited in 
the ColiScope database [5], available through the web- 
based interface MaGe [32]. Pseudogenes were excluded 
from the analysis. On this dataset we run the closure 
procedure, followed by: the refinement procedure, the 
TIS voting procedure, and the clean up procedure. For 
comparison we also include annotations for the same set 
of strains, but downloaded from the PATRIC database 
[2]. 

In order to assess the improvement of annotation con- 
sistency, after running eCAMBer, we calculated the mean 
absolute difference in the number of annotated multi- 
genes between two neighbour strains. It is 311 for the 
original annotations from ColiScope vs. 159 after apply- 
ing eCAMBer. Analogous statistics on the dataset from 
PATRIC are 409 for the original annotations and 311 after 
applying eCAMBer. 

In the dataset of 20 E. coli strains from ColiScope 
database, after the closure procedure, eCAMBer identifies 
73 gene families which have the following property: each 
family has a member in every strain, and for each fam- 
ily exactly one strain has a missing original annotation in 
that family. The top three strains with the highest num- 
ber of missing gene annotations of that type are: Sdl97 
(13), 2a 24S7T (8) and 536 (7). The most well-studied 
strain K-12 MG1655 has four missing annotations of the 
above described type. These annotations were added by 
eCAMBer during the closure procedure. 
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Based on this case study, we also investigate how eCAM- 
Ber improves consistency of TISs. There are 8038 pairs of 
originally annotated genes with different TISs, but with 
identical sequence (including lOObp. upstream region 
from the TIS of the longer annotation). This number was 
reduced to 4230 after applying the TIS majority voting 
procedure and the clean up procedure. 

This case study also shows that inconsistencies, which 
come from annotation errors, are present even for a very 
well-studied bacterial organism like E. coli. Note also 
that the discussed annotation inconsistencies were identi- 
fied among strains with annotations curated by the same 
laboratory. 

Comparison of features of eCAMBer and other tools 

CAMBer, eCAMBer, Mugsy-Annotator and the GMV 
pipeline aim to improve annotation consistency and 
accuracy. But there are some important differences 
between these approaches and their features (Table 4). For 
example, CAMBer and Mugsy-Annotator require gene 
annotations to be provided, whereas the GMV pipeline 
generates the input annotations using Prodigal and there 
is no straightforward way to substitute these annotations 
with any other. Thus, in all computational experiments 
involving the GMV pipeline were run only on Prodigal 
annotations. eCAMBer also integrates Prodigal as a tool 
to generate input annotations; however, it also allows the 
user to provide any other annotations as the input. All the 
tools require genome sequences at the input. 

Different tools also aim in solving different annotation 
problems. For example, the GMV pipeline only identi- 
fies and solves TIS annotation inconsistencies, whereas 
Mugsy-Annotator also tries to identify missing genes. Our 
new tool, eCAMBer, is capable of resolving TIS incon- 
sistencies, as well as removal of overannotated genes and 
addition of missing genes (Table 4). Our previous tool 
only identifies annotation inconsistencies, but it does not 
propose corrections. 



Support for multithreading is a valuable feature for 
computationally demanding problems. Thus, it should be 
noted that eCAMBer has the most comprehensive sup- 
port for multithreading among the tools considered. It 
allows the use of multiple threads for each of its steps. 
The GMV pipeline and CAMBer support multithread- 
ing only for BLAST computations. Mugsy-Annotator does 
not support it (Table 4). 

Evaluation of annotation accuracy 

In order to evaluate accuracy of annotations produced 
by eCAMBer, Mugsy-Annotator and the GMV pipeline, 
we apply the tools to annotations produced by the auto- 
matic annotation pipeline in PATRIC [2] for the set of 20 
E. coli strains with manually curated annotations in the 
ColiScope database [5]. As an alternative dataset of input 
annotations for the same set of strains we use annotations 
generated using Prodigal [16]. 

In all our comparative experiments we run Mugsy- 
Annotator and the GMV pipeline with default parameters. 
It should also be mentioned that both Mugsy-Annotator 
and the GMV pipeline output lists of suggestions of 
changes to input annotations, rather than actually output 
the corrected annotations. We post-processed these pro- 
posed lists of changes to generate the output annotations 
used for the comparative experiments. 

First we assess the correctness of the changes intro- 
duced to the input annotations based on the dataset of 
gene annotations with experimental support available in 
the EcoGene 3 database [31]. This dataset consist of 922 
gene annotations for the K-12 MG16SS strain. From this 
set we excluded four genes: fdhF, prfB, rph\ insN'; since 
their sequences corresponding to the annotated coor- 
dinates are disrupted (the length of the sequence from 
the start codon to the stop codon is not divisible by 
3). Additionally, we ran one iteration of the eCAMBer 
closure procedure to transfer the set of 918 gene annota- 
tions on the remaining 19 strains. The transferred gene 



Table 4 Qualitative comparison of different tools 





CAMBer 


eCAMBer 


Mugsy-Annotator 


GMV 


Input data 


GS, GA 


GS, optional GA 


GS, GA 


GS 


Mapping of similar sequences 


BLAST 


BLAST 


Multiple WGA 


BLAST 


Detection of gene presence inconsistencies 


Yes 


Yes 


Yes 


No 


Detection of gene start inconsistencies 


Yes 


Yes 


Yes 


Yes 


Correction of gene presence annotations 


No 


Yes (add. and rem.) 


Yes (only add.) 


No 


Correction of gene start annotations 


No 


Yes 


Yes 


Yes 


Multithreading 


Partial 


Yes 


No 


Partial 



Qualitative comparison of different tools. Columns correspond to the tools, whereas rows correpond to different qualitative features of these tools. Acronyms "GS" 
and "GA" denote genome sequences and genome annotations, respectively. Acronym "WGA" stands for whole genome alignment. Both CAMBer and the GMV 
pipeline have partial support for multithreading computations since only BLAST computations can be executed in parallel. 
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annotations share at least 80% of sequence identity with 
original annotations for strain K-12 MG1655. 

Table 5 presents statistics for the TIS changes intro- 
duced by different tools compared against the dataset 
described above. There are three different scenarios: (i) 
a correct TIS annotation is changed to an incorrect one 
(orange); (ii) an incorrect TIS annotation is changed to 
another incorrect TIS (yellow); (iii) an incorrect TIS is 
changed to the correct one (green). Since for each gene, 
there is only one TIS annotation considered as correct, 
there is no possible change from one correct TIS to 
another one. For each strain the majority of TIS changes 
introduced by eCAMBer is correct. In this experiment 
eCAMBer made 89 TIS changes from incorrect to cor- 
rect and only 12 TIS changes from correct to incorrect 
on the dataset of Prodigal annotations. For comparison, 
GMV made 47 incorrect to correct TIS changes and 8 cor- 
rect to incorrect TIS changes, on the same dataset. Thus, 
the number of correct TIS annotations has increased by 77 
in case of eCAMBer and by 39 in case of GMV. Applica- 
tion of Mugsy-Annotator made more wrong changes than 
correct. Additional file 1 shows panel figures for results of 
eCAMBer, Mugsy-Annotator and GMV on both PATRIC 
and Prodigal annotations. 

Since the extended dataset of annotations from Ecogene 
3 constitutes only about 20% of all genes in the 20 strains 
of E. coli it is not sufficient for direct assessment of over- 
all quality of changes introduced by eCAMBer and other 
tools. In particular we cannot conclude if a gene annota- 
tion is correct or not based on its absence in this dataset 
(so that there is no gene annotations in the dataset sharing 
the same stop codon). Thus, we perform further assess- 
ment of the quality of changes introduced relying on man- 
ually curated annotations for the set of 20 E. coli strains 
in the ColiScope dataset [5]. It is a reasonable choice as a 



Table 5 Overall statistics for TIS changes 







PATRIC 




Prodigal 


Statistic 


MA 


eCAMBer 


GMV 


MA eCAMBer 


# of incorrect^correct 
TIS changes 


839 


392 


47 


132 89 


# of incorrect^incorrect 
TIS changes 


215 


50 


5 


96 8 


# of correct^incorrect 
TIS changes 


892 


92 


8 


672 12 



Overall statistics for TIS changes introduced by eCAMBer, Mugsy-Annotator 
(MA) and the GMV pipeline. The tools were run on the dataset of 20 E. coli with 
annotations from the PATRIC database (columns 2 to 3) and generated using 
Prodigal (columns 4 to 6). Correctness of the changes introduced was assessed 
by comparison them against the set of experimentally verified gene annotations 
available in the EcoGene 3 database for the K- 12 MG 1 655 strain. Gold standard 
annotations for the remaining 1 9 strains were obtained by homology transfer of 
that set of 91 8 annotations. Statistic presented in this table include only that 
subset of genes which share the same stop codon as any of the genes in the 
gold standard. 



gold standard, since many of the annotations have experi- 
mental support. In particular, the annotation for the strain 
K-12 MG16SS contains 901 out of 918 gene annotations 
present in the dataset described previously. For compari- 
son, for this strain, there are only 841 and 883 such gene 
annotations for PATRIC and Prodigal, respectively. 

Next, Figure 5 presents the assessment of TIS changes 
introduced during the TIS voting procedure based on the 
ColiScope dataset. It shows the assessment of the TIS 
changes introduced to the input PATRIC annotations, 
with respect to each of the 20 E. coli strains. Statis- 
tic presented in this figure distinguishes three different 
scenarios: (i) a correct TIS annotation is changed to an 
incorrect one (orange); (ii) an incorrect TIS annotation is 
changed to another incorrect TIS (yellow); (iii) an incor- 
rect TIS is changed to the correct one (green). Since for 
each gene, there is only one TIS annotation considered 
as correct, there is no possible change from one correct 
TIS to another one. For each strain the majority of TIS 
changes introduced by eCAMBer is correct. Additional 
file 2 shows analogous panel figures for results of eCAM- 
Ber, Mugsy-Annotator and GMV on both PATRIC and 
Prodigal annotations. Rows 5 to 8 of Table 6 summarize 
the overall impact of eCAMBer and Mugsy-Annotator 
on TIS annotations. Remarkably, 70% (1591 out of 2260) 
of TIS changes introduced by eCAMBer to PATRIC 
annotations were correct. For comparison, only 43% of 
the TIS changes introduced by Mugsy-Annotator were 
correct. 

Figure 6 presents the assessment of gene additions and 
removals introduced during the closure and the clean up 
procedures, respectively. It shows the assessment of the 
changes introduced to the input PATRIC annotations, 
with respect to each of the 20 E. coli strains. Statistic 
presented in this figure distinguishes four different sce- 
narios: (i) a missing genome annotation is correctly added 
during the closure procedure (blue); (ii) a wrong gene 
annotation is correctly removed during the clean up pro- 
cedure (green); (iii) a wrong gene annotation is incorrectly 
added during the closure procedure (red); and (iv) a cor- 
rect gene annotation is incorrectly removed during the 
clean up procedure (orange). It can be seen that, for each 
strain, the majority of changes introduced by eCAM- 
Ber is correct. Additional file 3 shows analogous panel 
figures for results of Mugsy-Annotator and eCAMBer on 
both PATRIC and Prodigal annotations. The first four 
rows of Table 6 summarize the overall impact of eCAM- 
Ber and Mugsy-Annotator on gene presence. The results 
show that eCAMBer outperforms Mugsy-Annotator in 
this aspect. For example, 70% of the changes introduced by 
eCAMBer to PATRIC annotations were correct, whereas 
it was only 26% for Mugsy-Annotator. 

Finally, we investigate how the whole pipelines imple- 
mented in eCAMBer, Mugsy-Annotator and GMV 
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□incorrect->correct TIS change 
□incorrect->incorrect TIS change 
□correct->incorrect TIS change 




Figure 5 Statistics for TIS voting procedure. Impact of the TIS voting procedure of eCAMBer on annotations from the PATRIC database. 
Annotations from the ColiScope database were used to assess correctness of TIS changes. Note, that since for each gene, there is only one TIS 
annotation considered as correct, thus there is no possible change from one correct TIS to another one. 



improve the overall annotation accuracy. Here, the accu- 

ii r j c j o precision-recall 

racy is measured by/i statistic, defined as 2 . p precision + recall , 
where precision 



TP+FP an< ^ reca ^ 



TP 



Here, 



TP+FN ' 

TP, FP and FN denote true positive, false positive and 
false negative prediction, respectively. Since a pair of gene 
annotations may have the same stop codon, but different 
TISs, we keep track on the results for both stop codon 
predictions and for the TIS predictions. 

Results of eCAMBer on PATRIC annotations in this 
experiment are presented in Figure 7. Note that each cor- 



rectly identified TIS determines also its correctly iden- 
tified stop codon, but not the other way round. Thus, 
the accuracy for the TIS prediction is lower than for the 
stop codons. As the figure shows, eCAMBer improves 
annotation accuracy, for each strain, both in terms of TIS 
annotations and stop codon annotations. Additional file 4 
shows analogous panel figures for results of eCAMBer, 
Mugsy-Annotator and GMV on both PATRIC and Prodi- 
gal annotations. Rows 9 and 12 of Table 6 summarize 
the change in accuracy when running different tools on 
PATRIC and Prodigal annotations. It is clear from this 



Table 6 Overall accuracy statistics for different tools 







PATRIC 








Prodigal 




Statistic 


Input 


MA 


eCAMBer 


Input 


GMV 


MA 


eCAMBer 


# of incorrectly removed genes 


NA 


0 


1224 


NA 


0 


0 


388 


# of incorrectly added genes 


NA 


1177 


792 


NA 


0 


344 


331 


# of correctly removed genes 


NA 


0 


3993 


NA 


0 


0 


1185 


# of correctly added genes 


NA 


410 


701 


NA 


0 


210 


1447 


# of incorrect^correct TIS changes 


NA 


4812 


1591 


NA 


149 


1015 


290 


# of incorrect^ incorrect TIS changes 


NA 


2223 


747 


NA 


28 


1018 


113 


# of correct^incorrect TIS changes 


NA 


4279 


669 


NA 


78 


3618 


170 


Precision for gene starts 


0.665 


0.663 


0.699 


0.764 


0.764 


0.734 


0.775 


Recall for gene starts 


0.695 


0.702 


0.703 


0.752 


0.753 


0.727 


0.765 


fl for gene starts 


0.680 


0.682 


0.701 


0.758 


0.759 


0.731 


0.770 


Precision for gene ends 


0.892 


0.882 


0.920 


0.931 


0.931 


0.928 


0.940 


Recall for gene ends 


0.931 


0.935 


0.926 


0.917 


0.917 


0.919 


0.927 


fl for gene ends 


0.911 


0.908 


0.923 


0.924 


0.924 


0.923 


0.934 



Overall statistics for accuracy of changes introduced by eCAMBer, Mugsy-Annotator (MA) and the GMV pipeline. The tools were run on the dataset of 20 E. coli with 
annotations from the PATRIC database (columns 2 to 4) and generated using Prodigal (columns 5 to 8). Correctness of the changes introduced was assessed by 
comparison with annotations from the Coliscope database. Columns Input correspond to the original annotations. "NA" stands for not applicable. Rows correspond to 
different statistics of running each tool. 
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Figure 6 Statistics for closure and clean up procedures. Impact of the closure and clean up procedures of eCAMBer on the annotations from the 
PATRIC database. Annotations from the ColiScope database were used to assess correctness of gene removals and additions introduced by eCAMBer. 



table that eCAMBer outperforms other tools. For exam- 
ple, eCAMBer increased the fl statistic of initial anno- 
tations of Prodigal (for gene starts) from 0.764 to 0.775, 
whereas the application of GMV improved it only by 
0.001 and the application of Mugsy-Annotator decreased 
it by 0.027. In the case of PATRIC annotations, appli- 
cation of Mugsy-Annotator improved the accuracy from 
0.680 to 0.682. However, the accuracy of annotations after 
eCAMBer increased to 0.703. 

Conclusions 

We have developed eCAMBer, a tool for supporting large- 
scale comparative analysis of multiple bacterial strains. 
eCAMBer identifies and resolves annotation inconsisten- 
cies among closely related bacterial genomes. 

This tool works in two phases. First, it tries to transfer 
gene annotations among all considered bacterial strains. 



In this procedure, called closure, it also identifies homol- 
ogous gene families and annotation inconsistencies. The 
underlying idea behind the efficient implementation of 
the procedure is to avoid redundant BLAST queries. This 
approach greatly reduces the computational complexity, 
thus leading to much shorter running time than other 
tools. For example, on the dataset of 41 strains of E. coli, 
computations took less than two hours (using only one 
processing thread), whereas Mugsy-Annotator (the fastest 
competitor) took more than 19 hours. Moreover, eCAM- 
Ber supports multithreading for all its procedures. This 
allows eCAMBer to be used on much larger datasets 
comprising hundreds of bacterial strains. 

An idea, called compressive genomics, has recently been 
proposed with new approaches to optimize BLAST search 
time of sequence databases [33,34]. However, one sig- 
nificant conceptual difference, between these methods 




Figure 7 Comparison of annotation accuracy before and after applying eCAMBer. Comparison of annotation accuracy before and after 
applying eCAMBer on the dataset of 20 E. coli strains with annotations from PATRIC. Manually curated annotations from ColiScope were used as a 
gold standard. 
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and the closure procedure in eCAMBer, is that these 
approaches try to reduce the size of the target database, 
whereas the eCAMBer closure procedure reduces the 
redundancy among BLAST queries. It may be interesting, 
for further research, to combine these ideas. 

In the second phase, eCAMBer applies a majority 
voting-like approach, in the procedure called TIS voting, 
to choose the most reliable TIS for each gene. Finally, 
it removes possible gene annotation errors during the 
clean up procedure. Our case study experiments show 
that, in these steps, eCAMBer improves the quality of ini- 
tial annotations generated with automatic pipelines, such 
as PATRIC or Prodigal. For example, the application of 
eCAMBer to PATRIC annotations performed 1575 TIS 
changes, out of which 1183 (75%) were correct. 

Moreover, eCAMBer outperforms its competitors, 
Mugsy-Annotator and the GMV pipeline, in terms of 
improving quality of annotations. In particular, when run 
on Prodigal annotations for the set of 20 E. coli strains, 
eCAMBer increased the fl statistic of initial annotations 
from 0.764 to 0.775, whereas the application of GMV 
improved it only by 0.001 and the application of Mugsy- 
Annotator even decreased it. 

Finally, eCAMBer also has some limitations. One is 
that it purely relies on the quality of original annota- 
tions. Thus, for example, eCAMBer cannot identify genes, 
whose annotations are missing for all strains. Another lim- 
itation of eCAMBer is that pseudogenes and non-protein 
coding genes are excluded from the analysis. This follows 
from the assumption that eCAMBer considers only genes 
that start with start codon, end with stop codon, and have 
length divisible by 3. 
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Additional file 1 : Assessment of the correctness of TIS changes based 
on Ecogene 3.0. Comparison of the impact of applying eCAMBer, 
Mugsy-Annotator and the GMV pipeline on the quality of TIS annotations. 
The experiment was run on the dataset of 20 E. coli strains with 
annotations downloaded from PATRIC and generated using Prodigal. 
Correctness of changes introduced was assessed by comparison with the 
set of annotations downloaded from the EcoGene 3 database for the K-12 
MG1655 strain plus transferred annotations for the 1 9 remaining strains. 

Additional file 2: Assessment of the correctness of TIS changes based 
on ColiScope. Comparison of the impact of applying eCAMBer, 
Mugsy-Annotator and the GMV pipeline on the quality of TIS annotations. 
The experiment was run on the dataset of 20 E. coli strains with 
annotations downloaded from PATRIC and generated using Prodigal. 
Correctness of changes introduced was assessed by comparison with 
annotations in the ColiScope database. 

Additional file 3: Assessment of the correctness of gene removals 
and additions. Comparison of the impact of applying eCAMBer, 
Mugsy-Annotator and the GMV pipeline on the quality of gene ends 
annotations. The experiment was run on the dataset of 20 £ coli strains 
with annotations downloaded from PATRIC and generated using Prodigal. 
Correctness of changes introduced was assessed by comparison with 
annotations in the ColiScope database. 



Additional file 4: Accuracy: eCAMBer vs. other tools. Comparison of 
the impact of applying eCAMBer, Mugsy-Annotator and the GMV pipeline 
on accuracy annotations. To asses the accuracy f\ statistic was used. The 
experiment was run on the dataset of 20 £ coli strains with annotations 
downloaded from PATRIC and generated using Prodigal. Correctness of 
changes introduced was assessed by comparison with annotations in the 
ColiScope database. 
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